如今,配备了AI系统的摄像机可以捕获和分析图像以自动检测人员。但是,当在现实世界(即物理对抗示例)中收到故意设计的模式时,AI系统可能会犯错误。先前的作品表明,可以在衣服上打印对抗斑块,以逃避基于DNN的人探测器。但是,当视角(即相机与物体的角度)变化时,这些对抗性示例可能会在攻击成功率中造成灾难性下降。要执行多角度攻击,我们提出了对抗纹理(Advexture)。 advtexture可以用任意形状覆盖衣服,以便穿着这样的衣服的人可以从不同的视角躲避人探测器。我们提出了一种生成方法,称为基于环形作用的可扩展生成攻击(TC-EGA),以用重复的结构来制作advexture。我们用advexure印刷了几块布,然后在物理世界中制作了T恤,裙子和连衣裙。实验表明,这些衣服可以欺骗物理世界中的人探测器。
translated by 谷歌翻译
两阶段探测器在3D对象检测中已广受欢迎。大多数两阶段的3D检测器都使用网格点,体素电网或第二阶段的ROI特征提取的采样关键点。但是,这种方法在处理不均匀分布和稀疏的室外点方面效率低下。本文在三个方面解决了这个问题。 1)动态点聚集。我们建议补丁搜索以快速在本地区域中为每个3D提案搜索点。然后,将最远的体素采样采样用于均匀采样点。特别是,体素尺寸沿距离变化,以适应点的不均匀分布。 2)Ro-Graph Poling。我们在采样点上构建本地图,以通过迭代消息传递更好地模型上下文信息和地雷关系。 3)视觉功能增强。我们引入了一种简单而有效的融合策略,以补偿具有有限语义提示的稀疏激光雷达点。基于这些模块,我们将图形R-CNN构建为第二阶段,可以将其应用于现有的一阶段检测器,以始终如一地提高检测性能。广泛的实验表明,图R-CNN的表现优于最新的3D检测模型,而Kitti和Waymo Open DataSet的差距很大。我们在Kitti Bev汽车检测排行榜上排名第一。代码将在\ url {https://github.com/nightmare-n/graphrcnn}上找到。
translated by 谷歌翻译
一对一的匹配是DETR建立其端到端功能的关键设计,因此对象检测不需要手工制作的NMS(非最大抑制)方法来删除重复检测。这种端到端的签名对于DETR的多功能性很重要,并且已将其推广到广泛的视觉问题,包括实例/语义分割,人体姿势估计以及基于点云/多视图的检测,但是,我们注意到,由于分配为正样本的查询太少,因此一对一的匹配显着降低了阳性样品的训练效率。本文提出了一种基于混合匹配方案的简单而有效的方法,该方法将原始的一对一匹配分支与辅助查询结合在一起,这些查询在训练过程中使用一对一的匹配损失。该混合策略已被证明可显着提高训练效率并提高准确性。在推断中,仅使用原始的一对一匹配分支,从而维持端到端的优点和相同的DETR推断效率。该方法命名为$ \ MATHCAL {H} $ - DETR,它表明可以在各种视觉任务中始终如一地改进各种代表性的DITR方法,包括可变形,3DDER/PETRV2,PETR和TRANDRACK, ,其他。代码将在以下网址提供:https://github.com/hdetr
translated by 谷歌翻译
由于其低成本和设置简单性,单眼3D检测引起了社区的广泛关注。它以RGB图像为输入,并预测3D空间中的3D框。最具挑战性的子任务在于实例深度估计。以前的工作通常使用直接估计方法。但是,在本文中,我们指出RGB图像的实例深度是非直觉的。它是由视觉深度线索和实例属性线索结合在一起的,因此很难在网络中直接学习。因此,我们建议将实例深度重新调整为实例视觉表面深度(视觉深度)和实例属性深度(属性深度)的组合。视觉深度与对象的外观和图像上的位置有关。相比之下,属性深度依赖于对象的固有属性,这些属性与图像上的对象仿射转换不变。相应地,我们将3D位置的不确定性分解为视觉深度不确定性和属性深度不确定性。通过结合不同类型的深度和相关的不确定性,我们可以获得最终的实例深度。此外,单眼3D检测中的数据增强通常由于身体性质而受到限制,从而阻碍了性能的提高。根据提出的实例深度分解策略,我们可以缓解此问题。对Kitti进行了评估,我们的方法实现了新的最新结果,并且广泛的消融研究验证了我们方法中每个组件的有效性。这些代码在https://github.com/spengliang/did-m3d上发布。
translated by 谷歌翻译
当前仅激光雷达的3D检测方法不可避免地会遭受点云的稀疏性。提出了许多多模式方法来减轻此问题,而图像和点云的不同表示使它们很难融合,从而导致次优性能。在本文中,我们提出了一个新颖的多模式框架SFD(稀疏的保险丝密度),该框架利用了从深度完成生成的伪点云来解决上述问题。与先前的工作不同,我们提出了一种新的ROI Fusion策略3D-GAF(3D网格的专注融合),以更全面地使用来自不同类型的点云的信息。具体而言,3D-GAF以网格的细心方式从两点云中融合了3D ROI功能,这更细粒度,更精确。此外,我们提出了一种登录(同步增强),以使我们的多模式框架能够利用针对仅激光雷达方法的所有数据增强方法。最后,我们为伪点云自定义有效,有效的特征提取器CPCONV(色点卷积)。它可以同时探索伪点云的2D图像特征和3D几何特征。我们的方法在Kitti Car 3D对象检测排行榜上排名最高,证明了我们的SFD的有效性。代码可在https://github.com/littlepey/sfd上找到。
translated by 谷歌翻译
Benefiting from the intrinsic supervision information exploitation capability, contrastive learning has achieved promising performance in the field of deep graph clustering recently. However, we observe that two drawbacks of the positive and negative sample construction mechanisms limit the performance of existing algorithms from further improvement. 1) The quality of positive samples heavily depends on the carefully designed data augmentations, while inappropriate data augmentations would easily lead to the semantic drift and indiscriminative positive samples. 2) The constructed negative samples are not reliable for ignoring important clustering information. To solve these problems, we propose a Cluster-guided Contrastive deep Graph Clustering network (CCGC) by mining the intrinsic supervision information in the high-confidence clustering results. Specifically, instead of conducting complex node or edge perturbation, we construct two views of the graph by designing special Siamese encoders whose weights are not shared between the sibling sub-networks. Then, guided by the high-confidence clustering information, we carefully select and construct the positive samples from the same high-confidence cluster in two views. Moreover, to construct semantic meaningful negative sample pairs, we regard the centers of different high-confidence clusters as negative samples, thus improving the discriminative capability and reliability of the constructed sample pairs. Lastly, we design an objective function to pull close the samples from the same cluster while pushing away those from other clusters by maximizing and minimizing the cross-view cosine similarity between positive and negative samples. Extensive experimental results on six datasets demonstrate the effectiveness of CCGC compared with the existing state-of-the-art algorithms.
translated by 谷歌翻译
To generate high quality rendering images for real time applications, it is often to trace only a few samples-per-pixel (spp) at a lower resolution and then supersample to the high resolution. Based on the observation that the rendered pixels at a low resolution are typically highly aliased, we present a novel method for neural supersampling based on ray tracing 1/4-spp samples at the high resolution. Our key insight is that the ray-traced samples at the target resolution are accurate and reliable, which makes the supersampling an interpolation problem. We present a mask-reinforced neural network to reconstruct and interpolate high-quality image sequences. First, a novel temporal accumulation network is introduced to compute the correlation between current and previous features to significantly improve their temporal stability. Then a reconstruct network based on a multi-scale U-Net with skip connections is adopted for reconstruction and generation of the desired high-resolution image. Experimental results and comparisons have shown that our proposed method can generate higher quality results of supersampling, without increasing the total number of ray-tracing samples, over current state-of-the-art methods.
translated by 谷歌翻译
Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.
translated by 谷歌翻译
Representing and synthesizing novel views in real-world dynamic scenes from casual monocular videos is a long-standing problem. Existing solutions typically approach dynamic scenes by applying geometry techniques or utilizing temporal information between several adjacent frames without considering the underlying background distribution in the entire scene or the transmittance over the ray dimension, limiting their performance on static and occlusion areas. Our approach $\textbf{D}$istribution-$\textbf{D}$riven neural radiance fields offers high-quality view synthesis and a 3D solution to $\textbf{D}$etach the background from the entire $\textbf{D}$ynamic scene, which is called $\text{D}^4$NeRF. Specifically, it employs a neural representation to capture the scene distribution in the static background and a 6D-input NeRF to represent dynamic objects, respectively. Each ray sample is given an additional occlusion weight to indicate the transmittance lying in the static and dynamic components. We evaluate $\text{D}^4$NeRF on public dynamic scenes and our urban driving scenes acquired from an autonomous-driving dataset. Extensive experiments demonstrate that our approach outperforms previous methods in rendering texture details and motion areas while also producing a clean static background. Our code will be released at https://github.com/Luciferbobo/D4NeRF.
translated by 谷歌翻译
Deploying reliable deep learning techniques in interdisciplinary applications needs learned models to output accurate and ({even more importantly}) explainable predictions. Existing approaches typically explicate network outputs in a post-hoc fashion, under an implicit assumption that faithful explanations come from accurate predictions/classifications. We have an opposite claim that explanations boost (or even determine) classification. That is, end-to-end learning of explanation factors to augment discriminative representation extraction could be a more intuitive strategy to inversely assure fine-grained explainability, e.g., in those neuroimaging and neuroscience studies with high-dimensional data containing noisy, redundant, and task-irrelevant information. In this paper, we propose such an explainable geometric deep network dubbed as NeuroExplainer, with applications to uncover altered infant cortical development patterns associated with preterm birth. Given fundamental cortical attributes as network input, our NeuroExplainer adopts a hierarchical attention-decoding framework to learn fine-grained attentions and respective discriminative representations to accurately recognize preterm infants from term-born infants at term-equivalent age. NeuroExplainer learns the hierarchical attention-decoding modules under subject-level weak supervision coupled with targeted regularizers deduced from domain knowledge regarding brain development. These prior-guided constraints implicitly maximizes the explainability metrics (i.e., fidelity, sparsity, and stability) in network training, driving the learned network to output detailed explanations and accurate classifications. Experimental results on the public dHCP benchmark suggest that NeuroExplainer led to quantitatively reliable explanation results that are qualitatively consistent with representative neuroimaging studies.
translated by 谷歌翻译